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Abstract.  An  implementation  of  the  ASCEND/DESCEND  types  of  vector 
algorithms  (like  FFT)  is  adapted  to  the  current  model  of 
"NYU-Dltracomputer",  a  shared  memory  asynchronous  parallel  processor, 
including  basic  features  of  the  proposed  hardware  and  software. 
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to  the  present  parallel  computer  model  [UCN32],  including  the  operating 
system  implementation  proposed  in  [UCN13,  UCSSNl ] . 


2.  The  ASCEND/DESCEND  types  of  vector  algorithms. 

The  ASCEND/DESCEND  algorithms  under  consideration  were  introduced 
in  [PreVui]  as  follows: 

Let  n  =  2*^,  k  be  a  nonnegative  integer,  and  let  n  input  data  items 
tgj  t^,...,  tjj_2  be  stored,  respectively,  in  storage  locations 
T[0],  T[l],  ...   ,  T[n-1]. 

Informally,  an  ASCEND  type  algorithm  is  performed  on  data  that  are 
successively  2*^,  2^,...,  2*^"^  locations  apart,  whereas  a  DESCEND  type 
algorithm  is  performed  on  data  that  are  successively  2^"^,  2^"^  ...  2*^ 
locations  apart. 

More  precisely,  an  algorithm  of  the  ASCEND  type  is  one  of  the 
following  form: 
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3.  The  parallel  computer  model  and  statement  of  the  problem. 

Our  parallel  computer  model  consists  of  a  fixed  number  of 
identical  processing  elements  (PEs)  connected  through  a  network  to  a 
common  memory  (CM).  Also  each  PE  possesses  its  own  private  memory 
(PM).  Accesses  to  private  memory  have  much  less  delay  than  accesses  to 
common  memory.  We  assume  that  the  parallel  computer  is  provided  with 
the  operating  system  smart  enough  to  feature  the  following  two  system 
primitives  (see  [UCN13,  UCSSNl]): 

Spawn  (taskname,  multiplicity,  list  of  parameters); 

Wait  (taskname). 

Let  us  briefly  outline  the  semantics  of  these  primitives.  The 
invocation  of  Spawn  with  the  given  parameters  causes  the  insertion  in 
some  queue  of  an  item  that  contains  (or  is  linked  with)  the  task  named 
taskname  and  its  parameters  and  must  be  executed  (concurrently)  number 
of  times  equal  to  the  multiplicity.  The  invocation  of  the  Wait  causes 
the  calling  process  to  wait  (or  suspend)  until  all  tasks  named  taskname 
are  completed  (and  the  corresponding  item  is  deleted  from  the  queue). 
The  standard  way  of  using  these  primitives  is  the  following.  A  parent 
process  first  Spawns  a  number  of  children  processes  (usually  for 
execution  of  a  large  number  of  relatively  independent  computations, 
that  in  serial  programs  are  implemented  by  loops).  Then  the  parent 
process   calls  Wait  and  by  this  suspends  itself  until  all  the  Spawned 
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To  specify  the  case  further,  suppose  that  the  whole  parallel 
computer  is  in  the  disposal  of  our  problem.  Furthermore,  suppose  that 
the  computer  has  four  PEs  and  that  they  start  simultaneously.  We 
number  the  PEs  from  0  to  3.  A  variant  of  a  possible  distribution  of  the 
computations  between  the  PEs  is  given  in  the  table  below,  where  rows 
represent  j-stages  (from  0  to  k-1  -  3)  of  the  computations  and  coliimns 
represent  m-coordinates  (from  0  to  n-1  =  15).  We  place  the  number  of  a 
PE  executing  the  computation  OPER  (m,  j;  T[m] ,  T[m+2J])  at  the 
intersection  of  j-th  row  and  m-th  column. 

No  one  PE  can  begin  stage  j  ■  2  until  all  computations  at  stages 
j  "  0  and  j  ■  1  are  complete  so  there  must  be  a  busy-wait 
sjmchronization  between  stages  j  =  1  and  j  -  2. 


m  "  11  0  1112 
j=0  H  0  1  0  1  0 

1  3  1  4  1  5  1  6  1  7  1  8  j  9  no  HI  112  113  |1A  115 
I0|llllllli2l2l2l2|3|3l3l3 

j=i  n  0  1  0  1  0 

|0llllllll|2|2l2i2l3l3|3l3 

j=2  D  0  1112 

I3|0|l|2l3|0lll2|3|0|l|2l3 

j=3  11  0  1112 

I3l0|ll2l3|0|li2l3l0lll2i3 

Table  4.1 


Now  we  compare  the  presented  distribution  with  the  one  resulted 
from  [UCNll,  Krus].  The  latter  would  suggest  to  assign  columns 
m  »  4'numPE,  4  mumPE+I , . . . ,  4 'numPE+S  for  the  execution  by  PE  numPE 
(numPE  =0,  1,  2,  3).   And  two  busy-wait  synchronisations  are  supposed 
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PEnum  -PO             |1             |2             |3             |4             |5             | 
time  -  O.ao^o          |   a^^^        ,   sq^z        1   ^0,3        1                   1                   I 

time  -  ISa^^o          |   a^^i        |   ai,2        1   ^1,3        1                   |                   | 

time  -   2  1  bo^Q           1    bo,i         1    bo,2        1    bo, 3         1                    |                    | 

time  -   3B  bj^^   ^^    |    b^^j |    bj^j         1    ^,3         1                    1                    | 

Table  4.2 

Rows  in  this  and  the  next  table  correspond  to  successfully  counting 
time  units,  with  one  unit  corresponding  to  two  successful  j-stages  in 
table  4.1.  The  empty  entries  correspond  to  busy-waiting  periods.  The 
symbols  a^^^  (J  =  0,1,  M  =  0,1,2,3)  in  table  4.2  are  used  to  represent 
computations  for  the  problem  I  corresponding  to  row  numbers  2 'J,  2«J+1 
and  column  numbers  4 'M,  4«M+1,  4»Mf2,  4  •Mf3  from  table  4.1.  Analogous 
convention  is  accepted  for  symbols  bj  j^  that  represent  problem  II. 


Our  solution  represented  in  table  4.3  covers  the  most  of  the 

busy-waiting  gap  for  PEs  4  and  5  and  hence  cuts  (in  this  example)   the 

total   computational   time   by  25%.   The  symbols  A,  ^  (J  =  0  1 

J  ,M         '  ' 

M  =  0,1,2,3)  in  table  4.3  are  used  to  represent  computations  for  the 
problem  I  corresponding  to  the  entries  M  from  the  row  numbers  2 'J, 
2 'J+l  from  table  4.1.  Analogous  convention  is  accepted  for  symbols 
Bj  j^  that  represent  problem  II. 
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(J,M,  )-package  and  (J,M2)-package  are  disjoint  when  M^  *  Vi^-  Note  also 
that  if  R[in]  belongs  to  a  (J,M)-package,  (J-l)p  <_  j  £  Jp-1 ,  and 
bit.(ni)  "  0,  then  R[nri-2J]  also  belongs  to  the  same  (J,M)-package. 

Now  assuming  p   to   be   fixed   we  define   the   operation 

PACKOPER  (M,  J;  T)   as   ASCEND  operation  (1)   performed  under   the 

(J,M)-package  of  T  only.  Namely  the  result  of  PACKOPER  (M,  J;  T)  is 
given  by  the  following  code. 

procedure  PACKOPER  (M,  J;  T) 
do  j  <-  Jp  to  (J+l)p-l 

do  for  all  m  =  Sj(M,0),  Sj(M,l),...,  Sj(M,2P-l)         }  (2) 
if  bit.(m)  =  0  then  OPER  (m,  j;  T[m],  T[m4-2J]) 

Now  we  can  rearrange  computation  (1)  as  follows 

procedure  ASCEND 

do  J  <-  0  to  r-1  }  (3) 

do  for  all  M,  0  <  M  <  2P(r"l)-l 
PACKOPER  (M,  J;  T) 

One  can  easily  see  that  (3)  turns  into  (1)  if  p  =  1.  (In  this  case  the 
effects  of  the  last  two  lines  in  (1)  and  (3)  are  identical.) 

The  representation  (3)  for  the  ASCEND  algorithm  can  be  easily 
transformed  into  parallel  form,  using  the  Spawn  and  Wait  primitives. 
In  this  form  the  Spawned   tasks  are  PACKOPER  (M,  J;  T)   for  various 
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is  a  location  in  PM.  If  p  (and  hence  a  task)  is  large  enough  a  more 
economical  way  of  performing  PACKOPER  is  to  first  store  J  and  a 
(J,M)-package  of  T  into  the  private  memory  of  a  PE  then  to  perform  p 
executions  of  j-cycles  of  (2)  and  finishing  the  task  by  storing  the 
result  back  into  the  CM. 

Note  that  we  have  restricted  the  size  of  packages  by  requirement 
"p  is  an  integer  divisor  of  k",  where  p  is  the  "depth"  of  pakages,  k  is 
the  "depth"  of  the  entire  problem.  This  restriction  can  easily  be 
eliminated  with  the  staight-forward  modifications  to  the  proposed 
algorithm. 


5.  Experiments  with  the  Fast-Fourier  Transform  Algorithm. 


After  the  bit-reversal  permutation  of  the  original  array,  FFT 
becomes  an  ASCEND  algorithm  (1)  with  OPER  (m,  j;  T[m],  T[m+2J])  of  the 
form 

T[m]    <-   T[m]  +  a(m,j) 
T[iiri-2J]  <-   T[m]  -  a(m,j), 

where  a(m,j)  =  uj^  loWj(m)^  ^  ^^  ^  primitive  n-th  root  of  unity, 
low.(in)  =  m(mod2J).  This  special  form  of  OPER  allows  the  well-known 
optimization  of  precalculating  by  forming  an  array  a(m,j), 
m  =  0,...,n-l;   j  =  l,...,k.   If  this  array  is  allocated  in  CM  and  each 
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following  way:  the  economy  of  using  our  algorithm  is  more  significant 
if  one  counts  delays. 
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